2024-10-052024-11-01 随手记 5 分钟读完 (大约775个字) 0次访问

Multi-Head Attention

DONE BERT 可解释性-从"头"说起 - 知乎 [[BERT]]，不停的 mask 结构，判断对指标的影响。[[2021/06/16]]

[[香侬科技@为什么Transformer 需要进行 Multi-head Attention？]]

利用多组 $$W$$ 值和 $$X$$ 相乘，得到多组不同的 $$Q$$ $$K$$ $$V$$，分别利用这几组向量去做 self-attenttion，最终将得到的 attention 结果 concat 在一起。

\begin{aligned} \text { MultiHead }(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \ldots, \text { head }_{\mathrm{h}}\right) W^{O} \\ \text { where head }_{\mathrm{i}} &=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}

W_{i}^{Q} \in \mathbb{R}^{d_{\text { model }} \times d_{k}}, W_{i}^{K} \in \mathbb{R}^{d_{\text { model }} \times d_{k}}, W_{i}^{V} \in \mathbb{R}^{d_{\text { model }} \times d_{v}}

论文中每一层有 h=8 个 attention

输入的向量大小为 512，为了保持大小相同，每个 attention 中的 $$d_k=d_v=d_{model}/h=64$$

从原理上来看，multi-head 相当于在计算次数不变的情况下，将整个 attention 空间拆成多个 attention 子空间，引入了跟多的非线性从而增强模型的表达能力。

论文中一共使用了三种 multi-head attention

encoder-decoder attention：query 来自前一个 decoder 层的输出，keys,values 来自最后一个 encoder 输出。
- 其意义是： decoder 的每个位置去查询它与 encoder 的哪些位置相关，并用 encoder 的这些位置的 value 来表示。
encoder self-attention：query,key,value 都来自前一层 encoder 的输出。这允许 encoder 的每个位置关注 encoder 前一层的所有位置。
decoder masked self-attention：query,key,value 都来自前一层 decoder 的输出。这允许 decoder 的每个位置关注 encoder 前一层的、在该位置之前的所有位置。
第一种 QVV 模式，后面两种 VVV 模式

Multi-Head Attention

Ryen Xiang

2024-10-05

2024-11-01

相关文章